home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Meeting Pearls 4
/
Meeting Pearls Vol. IV (1996)(GTI - Schatztruhe)[!].iso
/
Pearls
/
arc
/
UU-Coder
/
BCode
/
bcode.doc
next >
Wrap
Text File
|
1996-08-10
|
19KB
|
349 lines
X-SystemInfo: SWIPnet: comp.binaries.cbm
X-Message-No: 31 (database)
From: c.b.c Monthly Posting <cbm@gac.edu>
Subject: BCODE/UNBCODE (Part 0/2) .c UNIX
Date: Wed, 6 Dec 95 21:56:00
Message-ID: <4a53ie$ghh@lunen.gac.edu>
Reply-To: cbm-request@gac.edu (cbm-request)
Path: mn5.swip.net!seunet!news2.swip.net!mn6.swip.net!newsfeed.sunet.se!news01.sunet.se!sunic!news99.sunet.se!newsfeed.tip.net!news.seinf.abb.se!nooft.abb.no!Norway.EU.net!EU.net!howland.reston.ans.net!newsfeed.internetmci.com!mr.net!news.mr.net!lunen.gac.edu!news
Newsgroups: comp.binaries.cbm
Organization: Gustavus Adolphus College
Approved: mmiller3@gac.edu (comp.binaries.cbm)
NNTP-Posting-Host: mmiller3@zariski.gac.edu
BCODE/UNBCODE DOCUMENTATION for Version 2.00 [December 5, 1994]
1. INTRODUCTION
This is the documentation for the bcode and unbcode C programs. These
programs allow you to encode binary data into a text format that can be
e-mailed or posted to USENET newsgroups. Functionally, they are quite similar
to the uuencode/uudecode standard Unix utilities, except that these programs
can use three different encoding formats (including "uucode") and allow for
the automatic splitting, reordering, and incremental reassembly of multiple
file segments for long files, and provide CRC-32 and size error checking on
each encoded segment. In other words, these are "better-mousetrap" versions
of the uucode utilities.
Other Unix utilities give you some combination of these features for uuencoded
data, but, as far as I am personally able to determine, all other approaches
(including the "e" feature of "rn") are fundamentally flawed. Other
approaches attempt to parse the natural language of the subject lines and the
text file contents to figure out the file/segment information, but this is an
AI-complete problem and the results are hit-and-miss. The approach used here
alters the standard uucode format a little to give filename and segment number
data in a simple and consistent manner.
There is a catch, however. You can only extract multi-segment data that was
encoded using the encoding formats defined herein. Until the people on the
other end use the encoding formats these programs give (dare to dream), the
decoding program here is fundamentally no more powerful than the standard Unix
uudecode program (though better-done, IMHO).
The C code is in ANSI-C format. Both the "bcode.c" and "unbcode.c" source
files contain complete programs, so there is no need for a Makefile. Just
compile and run. The programs make no major system-specific assumptions, with
the exception that the decoder uses the system calls "rename" and "remove" for
files. On some systems, these will be "link" and "unlink". I also found that
the C++ compiler I tried wanted to have "<sysent.h>" included in order to use
its "link" and "unlink" calls.
2. ENCODING/DECODING FORMATS
2.1. NUCODE and UUCODE
The bcode and unbcode programs support four encoding formats: NUCODE, UUCODE,
BCODE, and HEXCODE. The NUCODE format is both upward and downward compatible
with the old, problematic, and incomplete UUCODE format that is very popular
out here in cyberspace. Here is what the NUCODE format looks like:
-nucode-begin 2 quote2
>+3$I)W,@:6YT;R!I="XB("T@0W)A:6<@0G)U8V4*
`
end
-nucode-end 2 30 efb8f0c5
dum de dum... random separation between the segments...
-nucode-begin 1 quote2
begin 640 quote2
M(DEF+"!A9G1E<B!E>'1E;G-I=F4@='=E86MI;F<L('EO=7(@<')O9W)A;2!I
M<R!S=&EL;"!T;V\@<VQO=RP@=')Y(&1R;W!P:6YG(&$*(&9E=R`G<VQE97`H
-nucode-continued 1 90 fdcdb3d3
Here, the encoded file "quote2" is encoded into two different segments and
the second segment is given first. Try running this through the decoder
and the encoded file will be extracted correctly.
The first line of segment #1 of the data begins with "-nucode-begin", which is
much less likely to appear in regular text than the uucode control token
"begin". The token is followed by the segment number and filename. For
downward compatibility, the standard 'begin mode filename' line is also given
so that a standard uudecode utility can decode NUCODEd data (if the data is
encoded into only a single segment, which it is not in the example). The body
of the encoded data is identical to the UUCODE format, and includes the "`"
and "end" lines in the final segment. The final line of the final segment of
NUCODEd data has the "-nucode-end" token, the segment number, the segment size
in bytes, and a CRC-32 error checking value in hexadecimal. The CRC algorithm
used here is the same as the one used by PKZIP and ZMODEM, and a table-driven
implementation is used, so calculating the CRC is no more expensive than
computing a simple checksum.
If a given segment is not the final segment of a file, then the control token
on the last line will be "-nucode-continued", indicating that the decoding
program should be on the lookout for more segments belonging to the file.
This method of encoding allows all files to be encoded in a single pass. Oh,
and of course, you can have multiple discontiguous file segments encoded in
different formats in the same input file or stream with the decoder.
In previous versions of this utility, all of the control tokens had a
two-hyphen prefix, which, IMHO, looks better than one, but it was pointed out
that using two hyphens at the start of a line can be misinterpreted by
anonymous mailing and posting services as being the beginning of a signature
block, and they will erase the line starting with the two hyphens and every
line following it to help retain the anonymity of the sender. This is fine
and dandy, but stripping the body of a bcoded message would not be a good
thing, so the new standard for bcoded data is for all control tokens to have
one hyphen, and the use of two hyphens has been relegated to the status of
being a "hysterical raisin". However, the decoder will, of course, still
accept control tokens with two hyphens and the encoder will generate output
with two-hyphen control tokens if you ask it to (with the -2 option). This
applies to all encoding formats supported.
2.2. BCODE
The BCODE format, for which these utilities were originally written and named,
looks like the following:
-bcode-begin 1 quote1
IklmLCBhZnRlciBleHRlbnNpdmUgdHdlYWtpbmcsIHlvdXIgcHJvZ3JhbSBpcyBzdGlsbCB0
b28gc2xvdywgdHJ5IGRyb3BwaW5nIGEKIGZldyAnc2xlZXAoLTEpJ3MgaW50byBpdC4iIC0g
Q3JhaWcgQnJ1Y2UK
-bcode-end 1 120 44fefcc6
The control lines are basically the same as for the NUCODE format (sans ugly
backward-compatibility garbage) and the body is identical to the format that
the BASE-64 encoding format used with MIME produces. The difference between
BCODE and MIME is that BCODE uses much simpler control information. The two
advantages of this format over NUCODE are that (1) the encoding is slightly
more efficient in that you don't need the data-length character at the start
of every line and the standard line length is a little longer, and (2) that
no ASCII characters are used that can be easily misinterpreted or mangled in
conversions to other character coding schemes.
2.3. HEXCODE
The HEXCODE format is a very simple hexadecimal format that can be used to
visually inspect a file, for downloading over a particularly unreliable
connection, or for bootstrapping purposes. HEXCODE looks like the following:
-hexcode-begin 1 quote3
000000:2249662c20616674657220657874656e7369766520747765616b696e672c2079:69
000020:6f75722070726f6772616d206973207374696c6c20746f6f20736c6f772c2074:c9
000040:72792064726f7070696e6720610a206665772027736c656570282d3129277320:24
000060:696e746f2069742e22202d2043726169672042727563650a:75
-hexcode-end 1 120 44fefcc6
Each line includes the hexadecimal file position and a simple 8-bit add-up
checksum. A simple decoder program can easily be written for bootstrapping
yourself if you are unable to use the C-language UNBCODE program on the
target platform.
3. BCODE PROGRAM
The usage of the BCODE program is as follows:
bcode [-vbuh12] [-l max_lines] [-p pref] [[[filename][-a encoding_alias]] ...]
The "-v" flag activates "verbose" mode, in which the program reports when it
opens a file for input or output.
The "-b", "-u", and "-h" flags specify that you wish to encode in bcode,
nucode, or hexcode, respectively. The default if none of these flags are used
is defined in the source code by the "DEFAULT_FORMAT" label. The factory-set
default for this label is UUCODE. The other possible values are BCODE and
HEXCODE.
The "-1" and "-2" flags tell the encoder whether to produce control tokens
with one or two hyphens. The default is the new standard, which is for one
hyphen to be produced, but you can change the default by setting the
"DEFAULT_TOKHYPH" label to either 1 or 2. The decoder can, of course, handle
either option. The reasoning behind these flags was described in a previous
section.
The "-l" flag and value allow you to specify the maximum number of encoded
lines to include in each segment of the encoded data. When this flag is used,
output is sent to special output files rather than to stdout (where it is
usually sent). One segment is sent to each special output file. These
special output files are named after the file being encoded, appended with a
".u" followed by the at-least-two-digit segment number, for the nucode format.
For example,
bcode -l 1000 junkfile
would put the bcoded segment data into "junkfile.u01", "junkfile.u02", ...,
"junkfile.u99", "junkfile.u100", etc. Each line of nucoded data contains 63
characters (which represent 45 raw data bytes), so 1000 lines will produce
63000 bytes of output (counting a CR and LF at the end of each line), which is
a good size for posting or for mailing to brain-damaged mailers (under 64K),
with a little extra text at the top. The max_lines value does not include the
control lines in the encoding format.
For the BCODE format, the special filenames are appended with a ".b" and the
segment number, and for HEXCODE, a ".h" and segment number. If you define the
"MESS_DOS" label during assembly, the special filenames will have the forms
"bcNNN.bco", "uuNNN.uue", and "hexNNN.hex" for bcode, nucode, and hexcode
formats, respectively. These names are compatible with the brain-damaged
MS-DOS filename format.
The "-p" flag allows you to give a filename prefix for the encoded-output
files produced by the "-l" option, so that you can, for example, have the
output files go into a different directory. Since this is a prefix rather
than a directory name, you have to include, on a Unix system, the extra "/"
at the end of the directory name, as in "/tmp/mydir/". The prefix argument
follows the "-p" flag.
If you include filenames on the command line, then input will be taken from
them in turn (otherwise, input is taken from stdin and labelled "stdin"). If
there is a "-a" flag following a filename, then the file is labelled as the
encoding_alias following the "-a" flag in the nucode/whatever control
information. You may include many filenames (and associated aliases) on a
command line to create a nucode/whatever "archive". You may use a "-a" flag
on a command with no filenames to give your own name to the stdin stream.
4. UNBCODE PROGRAM
The usage of the UNBCODE program is as follows:
unbcode [-ivdnf] [-p prefix] [filename ...]
The "-i", "-v", and "-d" flags are used to request different levels of
operational information: informative, verbose, and debugging, respectively.
Informative messages include when a file is completely pieced back together,
verbose information includes when a file is opened or closed, and debugging
information includes a dump of the internal "fragment" table that keeps track
of which segments of which files the decoder currently has decoded. To keep
our terminology straight, a "fragment" consists of one or more file
"segments". All of this information is sent to the "stderr" file stream.
The "-n" flag is similar to the "-i" flag, except that only the filenames are
spit out for the files reassembled, and the names are sent to the "stdout"
file stream. The feature is intended to make this program inter-usable with
other programs.
The "-f" flag tells the decoder to forcibly accept a segment that has some
kind of error in it. Normally, when a segment is found to contain an error
of any type it is simply discarded and not used. This works well with the
incremental-operation of the decoding process, which allows you to decode
different segments of a single input file on multiple runs of the decoder
program (i.e., you would get yourself a vaild copy of the segment in question
and rerun the decoder on it later). However, if you wish to take whatever
garbled data comes out of a corrupted segment, you can give the "-f" flag.
All syntax errors (e.g., bad character) in a segment are dealt with by
completely ignoring the line on which the error occurred (perhaps not ideal),
and error-check errors are simply ignored. Error messages are generated but
the decoded segment is kept as if it were valid.
The "-p" flag allows you to give a filename prefix to use for the temporary
files that are generated during the decoding process. Temporary files are
discussed fully in a sec. If the "-p" flag is not used, then the temporary
files are maintained in the current directory, and if the "-p" flag is used,
the temporary files are given whatever prefix you specify. Normally, you
would use this prefix feature to make the temporary files appear in a separate
directory from the current directory (which is where the final, reassembled
files will go). Since this is a prefix, you must remember to put the final
"/" on Unix directory names. It is recommended that on a Unix system that you
use something like "/tmp/mydir/" rather than just "/tmp/" to avoid name
collisions with other users who might be decoding stuff at the same time.
There may not be much savings in space for keeping temp files in a separate
directory, since the amount of data on hand will generally not exceed the full
size of all the extracted data, because of the way that the temporary files
are handled. In other words, the usage of temporary storage is quite
efficient. There will be no savings in time for using a temp prefix, since
the final files must be copied to the current directory after being extracted,
whereas they are simply renamed if temps are kept in the current directory.
The only real advantage is that, on some systems, the final order of extracted
files in a directory will not get mish-mashed (which would only happen if the
encoded data were horribly out of order, requiring the creation of lots of
temporary files).
Any number of filenames may be given on the command line, and stdin is used if
no names are given. The program will do a one-pass sweep of all input files,
so your system need not support random file accessing.
Intermediate segments are decoded immediately and placed into temporary files
in the current directory (or prefix "directory") named like "0BC00001", with
different numbers. These files are created and deleted as needed. Between
runs, if there are any files that have not yet been completely pieced
together, the "fragment" information is saved into "0BC-STAT", which can be
listed to see what is in the temporary files and which segments of the files
are missing. If you use the prefix option, make sure that you use the same
prefix between multiple runs if the runs are incomplete and leave this stat
file lying around. An example of this file's contents could be:
00001-00001 beg 0000043200 0BC00002 filea
00001-00001 beg 0000003264 0BC00004 fileb
00003-00003 mid 0000000667 0BC00001 fileb
00005-00006 end 0000074586 0BC00003 fileb
The first two columns with the dash between indicate the range of segment
numbers that are contained in the temporary file. The next column gives the
interpretation of the temporary file, indicating if it is the beginning,
middle, or the end of the complete file being decoded. The next column gives
the number of valid bytes in the decoded fragment, and the next gives the name
of the temporary file, and the final column gives the name of the file that
the segment belongs to.
The fact that the status of decoding is kept between runs means that you don't
have to have all of the segments of the final file(s) present at any one run.
This would be useful, for example, if you were reading a binaries newsgroup
that posts in a supported format and you come across a posting that has
multiple file segments in multiple articles. Rather than saving all of the
pieces into a file and exiting the news reader to decode, you could use the
"save to pipe" feature of most news readers. You would enter something like:
s | unbcode -i
(or have it progammed into a copy buffer to be auto-entered at the touch of
a mouse button) and the segment data would be interpreted and the decoder
would inform you when it has successfully patched together a complete binary
file. This approach also works if the pieces of a posting are not only out
of order, but also if multiple postings which you want are mish-mashed
together. A grander attack, again if all postings are in a supported
multi-segment format, might be:
53453-56209s | unbcode -i
If you get into trouble with a lot of garbage 0BC* files being left around
after a really screwed up decode attempt, you can simply delete all of the
0BC* files to wipe the slate clean.
This program makes the following assumptions about its execution environment:
we have sequential access to disk files, the append file mode is available,
and the file rename and remove operations are available. It also assumes the
compiler is able to assign structures.
5. CONCLUDING STUFF
These programs are Public Domain Software. You may use them and distribute
them freely, write filters that use the various formats, or rip into the code
and extract the guts for your own purposes. All that I ask if you modify the
code is that you leave my name in to show where the code came from and add
your name to take the heat off me if your modifications have bugs.
The files are also available via anonymous FTP from "ccnga.uwaterloo.ca" in
directory "/pub/cbm/unix" or from the World-Wide Web in URL
"http://ccnga.uwaterloo.ca/~csbruce/unix.html".
If you have any comments, questions or suggestions, you can contact me at
the following e-mail address. Oh, there is no warranty of any kind here,
so if use of this program causes your company to lose millions of dollars,
tough noogies.
-Craig Bruce
csbruce@ccnga.uwaterloo.ca
"Proposed standard unit of data storage: the 'Virtual Tree', equivalent to
120.8 megabytes."